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Abstract 

This work focuses on representing very high-dimensional 
global image descriptors using very compact 64-1024 bit 
binary hashes for instance retrieval. We propose Deep- 
Hash: a hashing scheme based on deep networks. Key 
to making DeepHash work at extremely low bitrates are 
three important considerations - regularization, depth and 
fine-tuning - each requiring solutions specific to the hash¬ 
ing problem. In-depth evaluation shows that our scheme 
consistently outperforms state-of-the-art methods across all 
data sets for both Fisher Vectors and Deep Convolutional 
Neural Network features, by up to 20% over other schemes. 
The retrieval performance with 256-bit hashes is close to 
that of the uncompressed floating point features - a remark¬ 
able 5 12x compression. 

1. Introduction 

A compact binary image representation such as a 64-bit 
hash is a definite must for fast image retrieval. 64 bits pro¬ 
vide more than enough capacity for any practical purposes, 
including internet-scale problems. In addition, a 64-bit hash 
is directly addressable in RAM and enables fast matching 
using Hamming distances. 

State-of-the-art global image descriptors such as Fisher 
Vectors (FV) [ 1 ] and Deep Convolutional Neural Network 
(DCNN) features [2, 3] allow for robust image matching. 
However, the dimensionality of such descriptors is typically 
very high: 8192 to 65536 floating point numbers for FVs[ I ] 
and 4096 for DCNNs [2], Bringing such high-dimensional 
floating point representations down to a 64-bit hash is a con¬ 
siderable challenge. 

Deep learning has achieved remarkable success in many 
visual tasks such as image classification [2, 4], image re¬ 
trieval [3], face recognition [5, 6] and pose estimation [7], 
Furthermore, specific architectures such as stacked re- 
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stricted Boltzmann machines (RBM) are primarily known 
as powerful dimensionality reduction techniques [8], 

We propose DeepHash, a deep binary hashing scheme 
that combines purpose-specific regularization with weakly- 
supervised fine-tuning. A thorough empirical evaluation on 
a number of publicly available data sets shows that Deep¬ 
Hash consistently and significantly surpasses other state- 
of-the-art methods at bitrates from 1024 down to 64. This 
is due to the correct mix of regularization, depth and fine- 
tuning. This work represents a strong step towards the Holy 
Grail of a perfect 64-bit hash. 

2. Related Work and Contributions 

Hashing schemes can be broadly categorized into un¬ 
supervised and supervised (including semi-supervised) 
schemes. Examples of unsupervised schemes are Itera¬ 
tive Quantization [9], Spectral Hashing [10], Restricted 
Boltzmann Machines [8], while some examples of state- 
of-the-art supervised schemes include Minimal Loss Hash¬ 
ing [11], Kernel-based Supervised Hashing [12], Ranking- 
based Supervised Hashing [13] and Column Generation 
Hashing [14]. Supervised hashing schemes are typically ap¬ 
plied to the semantic retrieval problem. In this work, we are 
focused on instance retrieval: semantic retrieval is outside 
the scope of this work. 

There is plenty of work on binary codes for descriptors 
like SIFT or GIST [9, 15, 16, 17, 10, 12, 18, 11, 19, 20, 21], 
There is comparatively little work on hashing descriptors 
like Fisher Vectors (FV) which are two orders of magni¬ 
tude higher in dimensionality. Perronnin et al. [1] propose 
ternary quantization of FV, quantizing each dimension to 
+ 1,-1 or 0. Perronnin et al. also explore Locality Sensitive 
Hashing [22] and Spectral Hashing [10]. Spectral Hashing 
performs poorly at high rates, while LSH and simple ternary 
quantization need thousands of bits to achieve good perfor¬ 
mance. Gong et al. propose the popular Iterative Quanti¬ 
zation (ITQ) scheme and apply it to GIST [9], In subse¬ 
quent work, Gong et al. [23] focus on generating very long 
codes for global descriptors, and the Bilinear Projection- 
based Binary Codes (BPBC) scheme requires tens of thou¬ 
sands of bits to match the performance of the uncompressed 
global descriptor. Jegouetal. propose Product Quantization 
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Figure 1. Our proposed hashing and model training pipeline. A high-dimensional global image descriptor, such as fisher vector and deep 
convolutional neural net feature, is extracted from an image. The trained DeepHash model transforms this image descriptor to a compact 
binary hash (between 64 to IK bits), via a succession of L nonlinear feedforward projections. The DeepHash model is trained in two 
phases: a unsupervised pre-training phase and a weakly supervised fine-tuning phase. In phase 1, restricted Boltzmann machines (RBMs) 
are trained in a layer-wise manner and stacked into a deep network. In phase 2, matching and non-matching pairs are used to construct 
deep Siamese networks for parameter fine-tuning. 


(PQ) for obtaining compact representations [ 24 ]. While this 
produces compact descriptors, the resulting representation 
is not binary and cannot be compared with Hamming dis¬ 
tances. As opposed to previous work, our focus is on gener¬ 
ating extremely compact binary representations for FV and 
DCNN features in the 64 bits-1024 bits range. 

In this paper, we propose DeepHash (Figure 1), a hash¬ 
ing scheme based on deep networks for high-dimensional 
global descriptors. The key to making the DeepHash 
scheme work at extremely low bitrates are three impor¬ 
tant considerations - regularization, depth and fine-tuning 
- each requiring solutions specific to the hashing problem. 

• We pre-train a deep network using a RBM regulariza¬ 
tion scheme that is specifically adapted to the hash¬ 
ing problem. This enhances the efficiency of compact 
hashes, while achieving performance close to the un¬ 
compressed descriptor. 

• Using stacked RBMs as a starting point, we fine-tune 
the model as a deep Siamese network. Critical improve¬ 
ments in the loss function lead to further improvements 
in retrieval results. 

• DeepHash training is only required to be performed 
once on a single large independent training set. Through 
a thorough evaluation against state-of-the-art hashing 
schemes used for instance retrieval, we show that Deep¬ 
Hash outperforms other schemes by a significant margin 
of up 20%, particularly at low bit rates. The results are 


consistently outstanding across a wide range of data sets 
and both DCNN and FV, showing the robustness of our 
scheme. 

3. DeepHash 

DeepHash is a hashing scheme based on a deep network 
to generate binary compact hashes for image instance re¬ 
trieval (Figure l). 1 Given a global image descriptor z°, a 
deep network performs a series of L layers of nonlinear 
projections to generate a compact hash z L . The model is 
trained in two phases: 1) greedy layer-wise unsupervised 
pre-training with hashing regularization and 2) weakly- 
supervised Siamese fine-tuning. 

In the unsupervised phase, stacked restricted Boltzmann 
machines (RBMs) [ 26 ] are used to learn the initial parame¬ 
ters of the deep network. Each new layer in the network is 
trained to model the data distribution of the previous layer 
and is regularized specifically for hashing. A key feature 
is that this unsupervised pre-trained model is easily trans¬ 
ferable. The unsupervised RBM parameters, which can 
be used to generate good hashes, can be further optimized 
with a fine-tuning phase. Fine-tuning is done through weak 
supervision by treating the deep model as a Siamese net¬ 
work [ 27 ], Fine-tuning is also carried out an independent 
data set. In the rest of this section, we will describe the 

1 DeepHash will be made publicly available on Caffe Model Zoo [25] 

(https : //github . com/BVLC/caffe/wiki/Model- Zoo). 































































details of the training process for our deep hashing scheme. 

3.1. Stacked Reguarized RBMs 

The deep network with L layers is initially pre-trained 
layer-by-layer from the bottom up through unsupervised 
learning, where each pair of successive layers tz , _ l and z l ) 
is trained as an RBM building block. An RBM is an bipar¬ 
tite Markov random field with the input layer z ,_1 £ R 1 
connected to a latent layer z l £ R J via a set of undirected 
weights W ( £ K /J . The input units z l ~ 1 and latent units 
z l j are also parameterised by their corresponding biases c\~ 
and bj, respectively. 

Binary RBMs. The first layer of the deep network takes 
a high-dimensional image descriptor as input. Previous 
works [1, 28] have shown that binarization of FV and 
DCNN features results in negligible loss in performance. 
For this work, binarization is done by component-wise 
mean thresholding for the inputs. We use binary latent units 
with sigmoid activation function, because binary output bits 
are desired for our hash. Binary RBMs are also faster and 
simpler to train as compared to continuous RBMs [29]. All 
layers in the deep network will consist of binary units and 
binary hashes can be extracted from all intermediate layers. 

The units within a layer are conditionally independent 
pairwise. Therefore, the activation probabilities of one layer 
can be sampled by fixing the states of the other layer, and 
using distributions given by logistic functions for binary 
RBMs: 

P(^-|z /_1 ) = 1/(1 + exp(-w J z i_1 - bj)), (1) 

p (^ _1 | zi ) = 1/(1 + exp(-w/V - a)). (2) 

As a result, alternating Gibbs sampling can be performed 
between the two layers. The sampled states are used to 
update the parameters {W z ,t/,c/ -1 } through minibatch 
gradient descent using the contrastive divergence algo¬ 
rithm [30] to approximate the maximum likelihood of the 
input distribution. 

Given a trained RBM with fixed parameters and an in¬ 
put vector, a hash can be generated through a feedforward 
projection and thresholding Equation (1) at 0.5. 

1, ifP(^-V)>0.5 (3) 

0, otherwise. 

Hashing Regularization. The unsupervised RBM is 
naively trained without considering the task, which in this 
case is image hashing. It is, however, important for the 
RBMs to project the data in a latent subspace that is suit¬ 
able for hashing. One way to encourage the learning of 
suitable representations is to perform regularization, such as 


sparsity [31, 32, 33], For classification, representations are 
encouraged to be very sparse to improve separability. For 
hashing, however, it is desirable to encourage the represen¬ 
tation to make efficient use of the limited latent subspace. 

For a given l and a minibatch of input instances z l ~ x , we 
add a regularization term to the RBM optimization problem 
to encourage (a) half the bits to be active for a given hash, 
and (b) each bit value to be equiprobable across hashes: 

argmin -^logf Y F (zL"\ z i) + Ah(£ a )), (4) 
{w'.bfc'-i} a \ z i a&£a / 

where £ a is the minibatch of sampled latent units for layer 
l and A is the regularization constant. 

We adapt the fine-grained regularization proposed in [33] 
to suit our hashing problem. For each instance z l a , the regu¬ 
larization term for binary units penalises each unit with 
the cross entropy loss with respect to a target activation t l j a 
based on a predefined distribution, 

h(S a ) = -Y _] Y t l ja log z l ja + (1 - t l ja ) log(l - z\ a ).{ 5) 
J 

Unlike [33], we choose the t*- a such that each {t l 3a } :j for 
fixed a and each {t l j a } a for fixed j is distributed according 
to U (0,1). The uniform distribution is suitable for hashing 
high-dimensional vectors because the regularizer encour¬ 
ages the each latent unit to be active with a mean of 0.5, 
while avoiding activation saturation. The result is a space¬ 
filling effect in the latent subspace, where data is efficiently 
represented. 

After RBM training, we further enforce space utiliza¬ 
tion by substituting the learned RBM bias by the data set 
mean (wjz l ~ x ) of the linear projection preceding the lo¬ 
gistic. Equation (3) is modified such that the final hash is 
centered around 0.5: 

^ f 1 * if w J z / - 1 -(w,z i - 1 )>0 (6) 

3 [ 0, otherwise. 

Stacked RBMs. The set of global image descriptors lie 
in a complex manifold in a very high-dimensional feature 
space. Deeper networks have the potential to discover more 
complex nonlinear hash functions and improve image in¬ 
stance retrieval performance. Following [26], we stack mul¬ 
tiple RBMs by training one layer at a time to create a deep 
network with several layers. 

Each layer models the activation distribution of the pre¬ 
vious layer and captures higher order correlations between 
those units. For the hashing problem, we are interested in 
low-rate points of 64, 256 and 1024 bits, which are typical 
operating points as discussed in Section 4. We progressively 
decrease the dimensionality of latent layers by a factor of 



Figure 2. A sample point (black dot) with corresponding matching (red 
dots) and non-matching (blue dots) samples. The contrastive loss used for 
fine-tuning can be interpreted as applying attractive forces between match¬ 
ing elements (red arrows) and repulsive forces between non-matching el¬ 
ements (blue aiTows). (a) The loss function (7) proposed in [34] with a 
single margin parameter for non-matching pairs (blue circle). Matching 
elements are subject to attractive forces regardless of whether they are al¬ 
ready close enough from each other which adversely affects fine-tuning, 
(b) Our proposed loss function (8) with an additional margin parameter 
affecting matching pairs reciprocally (red circle). 

2" per layer, where n is a tuneable parameter. For our final 
models, n is empirically selected for each layer resulting in 
variable network depth. 

3.2. Deep Siamese Fine-Tuning 

Retrieval results are driven by the structure of the local 
neighborhood around the query. The unsupervised training 
is followed by a fine-tuning step in order to improve the 
local structure of the embedding. The fine-tuning is per¬ 
formed with a learning architecture known as Siamese net¬ 
works first introduced in [27], The principle was later suc¬ 
cessfully applied to deep architectures for face identification 
[35] and shown to produce representations robust to various 
transformations in the input space [34]. The use of Siamese 
architectures in the context of image retrieval from DCNN 
features was recently suggested as a possible improvement 
to the state-of-the-art on the subject [3], 

A Siamese network is a weakly-supervised scheme for 
learning a similarity measure from pairs of data instances 
labeled as matching or non-matching. In our adaptation of 
the concept, the weights of the trained RBM network are 
fine-tuned by learning a similarity measure at every inter¬ 
mediate layer in addition to the target space. Given a pair 
of data (z°,zjg), a contrastive loss Vi is defined for every 
layer l and the error is back propagated though gradient de¬ 
scent. Back propagation for the losses of individual layers 
(l = 1..L) is performed at the same time. Applying the loss 
function proposed by Handsell et al. in [34] yields: 

A( z °> z !s) = yW^a-^pWl+^-y) max(m^||z^— zJjHj, 0) 

(7) 

where y = 1 if (z°, z^) is a matching pair or y = 0 oth¬ 
erwise, and m > 0 is a margin parameter affecting non- 



Figure 3. Recall @ R=10 on the Holidays data set (See Section 4.1 for a 
description of the data sets) over several iterations of Siamese fine-tuning. 
The recall rate quickly collapses when using the single margin loss func¬ 
tion suggested by Hadsell et al. [34] while performance is better retained 
when only non-matching pairs are passed. The double-margin loss solves 
the problem. The network is a stacked RBM (8192-4096-2048-64) trained 
with Fisher descriptors on the ImageNet data set. Matching pairs are sam¬ 
pled from the Yandex data set. For every matching pair, a random non¬ 
matching element is chosen from the data set to form two non-matching 
pairs. There are 33 matching pairs and 66 corresponding non-matching 
pair with every iteration. The test set is the Holidays data set. 

matching pairs. As shown in Figure 2(a), the effect is to 
apply a contractive force between elements of any match¬ 
ing pairs and a repulsive force between elements of non¬ 
matching pairs which element-wise distance is shorter than 
yjm. 

However, experiment results in Figure 3 show that the 
loss function (7) causes a quick drop in retrieval results. 
Results with non-matching pairs alone suggest that the han¬ 
dling of matching pairs is responsible for the drop. The in¬ 
definite contraction of matching pairs well beyond what is 
necessary to distinguish them from non-matching elements 
is a damaging behaviour, specially in a fine-tuning context 
since the network is first globally optimized with a differ¬ 
ent objective. Figure 4 shows that any two elements, even 
matching, are always far apart in high dimension. As a so¬ 
lution, we propose a double-margin loss with an additional 
parameter affecting matching pairs: 

A( z °, Zp) =y max(||z^-z^||^-TOi,0) 

+ (1 - y) max (m 2 - \\z l a - z^|||, 0) 

As shown in Figure 2(b), the new loss can thus be inter¬ 
preted as learning “local large-margin classifiers” (if mi < 
m 2 ) to distinguish between matching and non-matching ele¬ 
ments. In practice, we found that the two margin parameters 
can be set equal (m 1 = m 2 = m) and tuned automatically 
from the statistical distribution of the sampled matching and 
non-matching pairs (Figure 4). 

4. Experimental Results 
4.1. Evaluation Framework 

Global Descriptors. For the FV, we extract SIFT [36] 
features obtained from Difference-of-Gaussian (DoG) in¬ 
terest points. We use PCA to reduce dimensionality of 
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Figure 4. Histograms of squared Euclidean distances for 20,000 matching 
pairs and corresponding 40,000 non-matching pairs for an 8192-4096(top)- 
2048(middle)-64(bottom) stacked RBM network. The red and blue vertical 
lines indicate the median values for the matching and non-matching pairs 
respectively. The Siamese loss shared margin value m is systematically set 
to be the mean of the two values (black vertical lines). 
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Figure 6. Hashing FV for Holidays. (HR) refers to schemes trained with 
hashing regularization, (a) Hashing regularization improves performance 
significantly for single layer models 8192-6 as 6 is decreased, (b) Recall 
improves as depth is increased for lower rate points 6 = 64 and 6 = 
256. With regularization, we can achieve the same or better recall at lower 
depth. 


the SIFT descriptor from 128 to 64 dimensions, which has 
shown to improve performance [37], We use a Gaussian 
Mixture Model (GMM) with 128 centroids, resulting in 
8192 dimensions each for first and second order statistics. 
Only the first-order statistics are retained in the global de¬ 
scriptor representation, as second-order statistics only re¬ 
sults in a small improvement in performance [38]. The FV 
is Z/ 2 -normalized to unit-norm, after signed power normal¬ 
ization. We denote this configuration as the FV feature from 
here-on. 

DCNN features are extracted using the open-source soft¬ 
ware Caffe [25] for the 7-layer AlexNet proposed for Im¬ 
age Net classification in their seminal contribution [2], We 
find that layer fc6 (before softmax) performs the best for im¬ 
age retrieval, similar to the recently reported results in [3], 
We refer to this feature as the DCNN feature from here-on. 

Training Data. Most schemes, including our proposed 
scheme, require a training step. We use the ImageNet data 
set for training, which consists of 1 million images from 
1000 different image categories [39]. We randomly sample 
a subset of images from ImageNet. For the proposed deep 
Siamese fine-tuning scheme proposed, we use the 200K 
matching image pairs data set provided by Yandex in their 
recent work [3], consisting primarily of landmark images. 
For every matching pair, a random sample is picked to gen¬ 
erate 2 corresponding non-matching pairs. This training 
set is independent of the query and database data described 
next. 

Testing Data. We use 4 popular data sets for small 
scale experiments: Oxford (55 queries, 5062 database im¬ 
ages) [40], INRIA Holidays (500 queries, 991 database im¬ 
ages) [41], Stanford Mobile Visual Search Graphics (1500 
queries, 1000 database images) [42, 43] and University 
of Kentucky Benchmark (UKB) (10200 queries, 10200 
database images) [44], For large-scale retrieval experi¬ 


ments, we present results on Holidays and UKB data sets, 
combined with the 1 million MIR-FLICKR distractor data 
set [45], 

Comparisons. We compare several state-of-the-art 
schemes. Some have been proposed for lower dimen¬ 
sional vectors like SIFT and GIST, but we evaluate their 
performance on both FV and DCNN features. 

• ITQ [9], For the Iterative Quantization (ITQ) scheme, 
the authors propose signed binarization after applying 
two transforms: first the PCA matrix, followed by a ro¬ 
tation matrix, which minimizes the quantization error 
of mapping PCA-transformed data to the vertices of a 
zero-centered binary hypercube. 

• BPBC [23]. Instead of the large PCA projection matri¬ 
ces used in [9], the authors apply bilinear projections, 
which require far less memory. 

• LSH [22], LSH is based on random unit-norm projec¬ 
tions followed by signed binarization. 

• PQ [24], For FV and Product Quantization, we con¬ 
sider blocks of dimensions D = 64,256 and 1024, and 
train K = 256 centroids for each block, resulting in 
h = 64,256 and 1024 bit descriptors respectively. For 
DCNN, we consider blocks of dimensions D = 32,128 
and 1024, with K = 256 centroids, resulting in the 
same bitrates. Here, we do not apply Random Rota¬ 
tions, or PCA before applying PQ [24], Such prepro¬ 
cessing can be applied to other schemes too. This is not 
a binary hashing scheme and only included for refer¬ 
ence. 

We ignore Spectral Hashing [10] due to its inferior per¬ 
formance on FV in [1], 

4.2. DeepHash Experiments 

Hashing Regularization. In Figure 6(a), we show the ef¬ 
fect of applying regularization proposed in Section 3.1 on a 
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Figure 5. Comparing AUC, Recall and MAP performance of different schemes at varying b in (a),(b) and (c) respectively. Holidays and FV are used for 
retrieval experiments, and SMVS for AUC. DeepHash outperforms all schemes. Also, the performance ordering of schemes is largely consistent between 
AUC results and retrieval results, both MAP and Recall. AUC can be used for fast optimization of parameters. 


single layer RBM 8192-6, for 6 = 64, 256,1024. The Holi¬ 
days data set and FV features are chosen. Hashing regular¬ 
ization improves performance significantly, ~10% absolute 
recall @ R = 10 at low-rate point 6 = 64. The performance 
gap increases as rate decreases. This is intuitive as the regu¬ 
larization pushes the network towards keeping half the bits 
alive and equiprobable (across hashes), with its effect being 
more pronounced at lower rates. 

Depth. In Figure 6(b), we plot recall @ R = 10 for the 
Holidays data set and FV features, as depth is increased for 
a given rate point 6. For 6 = 1024, we consider configu¬ 
rations 8192-1024, 8192-4096-1024, and 8192-4096-2048- 
1024 corresponding to depth 1, 2, 3 respectively. For rate 
points b = 64 and 256, similar configurations of varying 
depth are chosen. We observe that, with no regulariza¬ 
tion, recall improves as depth is increased for 6 = 256 and 
6 = 64, with optimal depth of 3 and 4 respectively, beyond 
which performance drops. At higher rates of 6 = 1024 and 
beyond, increasing depth does not improve as performance 
saturates. For hashing, a sweet spot in performance for the 
depth parameter is observed for each rate point, as deeper 
networks can cause performance to drop due to loss of in¬ 
formation over the layers. Similar trends are obtained for 
recall @ R = 100. Importantly, we observe that with the 
proposed regularization, we can achieve the same perfor¬ 
mance with lower depth at each rate point. This is critical, 
as lower the depth, the faster the hash generation, and lower 
the memory requirements. 

Fine-Tuning. Table 1 provides detailed retrieval results 
for a 3-layer model before and after Siamese fine-tuning. 
The results show consistent improvements with every train¬ 
ing data set and at any bit-rate with a global average dif¬ 
ference of 2.78% (up to 6.24%). The difference is more 
significant at higher recall rates with an average of 2.43% 
@ R=10 compared to 3.13% @ R=100. They are however 
quite comparable when relative improvement rate is consid¬ 


Table 1. Retrieval results before and after Siamese fine-tuning, with cor¬ 
responding differences. The stacked RBM network (8192-4096-2048-64) 
is trained with Fisher descriptors from the ImageNet data set. Fine-tuning 
consistently improves retrieval results at any bit-rate. 
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64 

47.94 
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1.31 
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ered: 7.46% @ R=10 and 7.24% @ R=100 relatively. 

We notice differences across test sets with improvements 
on the Oxford set being more pronounced. The Yandex data 
set used for fine-tuning is made with matching pairs of land¬ 
mark structures which can explain the better performance of 
the Oxford data set made of buildings only. The systematic 
improvements on all sets are nevertheless evidence of the 
high transferability of both unsupervised training and semi- 
supervised fine-tuning. 

Fast Optimization with ROC Experiments. The Stan¬ 
ford Mobile Visual Search (SMVS) data set [42] contains 
a list of 16,319 matching image pairs, comprising a wide 
range of object categories. We extract FV for matching and 
non-matching pairs from the SMVS data set, hash the data 
to different rate points, and compute the Receiver Operat¬ 
ing Characteristic (ROC) curve. In Figure 5(a), we plot 
ROC Area Under Curve (AUC) for different schemes for 
the SMVS data set. In Figure 5(b) and 5(c), we plot re¬ 
call @10 and MAP on the Holidays data set. The retrieval 
performance of a scheme at a given database size depends 
on the ROC curve at different TPR/FPR operating points, 
as shown in [46]. Low FPR points are important [46, 47], 
We observe that the Area Under Curve (AUC) results pre- 
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Table 2. DeepHash Architecture 


diet well the performance ordering (MAP and Recall) of 
different schemes for retrieval experiments. The retrieval 
and AUC experiments are performed on very different data 
sets, but the AUC results generalize well, and are used for 
fast optimization of parameters. 

DeepHash Parameters. For RBM learning, we set the 
learning rate to 0.001 for the weight and bias parameters, 
momentum to 0.9, and ran the training on 150,000 images 
from the ImageNet data set for a maximum 30 epochs. Bi¬ 
nary descriptors for the first layer are generated by sub¬ 
tracting the mean for each data set. For each rate point, 
we consider a set of models with dimensionality progres¬ 
sively reduced by a factor of 2 from the starting representa¬ 
tion for FV and DCNN respectively. The best model is cho¬ 
sen based on greedy optimization of AUC on the SMVS data 
set, which works well as seen in detailed experimental re¬ 
sults of Section 4.3. The chosen architectures are described 
in Table 2. The depth of the network increases as hash size 
decreases. Each target setting requires several hours to train 
on a modern CPU. 

4.3. Retrieval Experiments 

We present retrieval results using FV and DCNN fea¬ 
tures in Figure 7 and Figure 8. For instance retrieval, it is 
important for the relevant image to be present in the first 
step of the pipeline, matching global descriptors, so that a 
Geometric Consistency Check (GCC) [48] step can find it 
subsequently. We present recall @ typical operating points, 
R = 100 and R = 1000 for small and large data sets respec¬ 
tively. For UKB small experiments, we plot 4x recall @ 
R = 4 to be consistent with the literature. We refer to 
hashes before and after fine-tuning as DeepHash and Deep- 
Hash+ respectively in all figures. We refer to deep hashes 
based on DCNN and FV features as DCNN-DeepHash and 
FV-DeepHash respectively. 

Performance of DeepHash. For DCNN and FV fea¬ 
tures, the proposed DeepHash outperforms state-of-the-art 
schemes by a significant margin on all data sets. The statis¬ 
tics of FV and DCNN features are very different. FV are 
dense descriptors with zero blocks corresponding to cen¬ 
troids not visited, while deep DCNN features tend to be 
sparse. Our method works well for both types of features. 

For the retrieval experiments in Figure 7, there is up to 
20% improvement in absolute recall at b = 64 bits com¬ 
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Figure 7. Small-scale retrieval results. DeepHash outperforms other 
schemes by a significant margin. 


pared to the second performing scheme. Up to 15% im¬ 
provement is seen at b = 256, which can be a practical 
rate point for many applications, as there is only a marginal 
drop in performance for DCNN features compared to un¬ 
compressed features. Similar trends are obtained for recall 
@ R = 10 and MAP, as seen by comparing Holidays re¬ 
sults in Figure 5(b), (c) and Figure 7(a), with a higher gap 
for larger R. Consistent trends are also obtained for the 
large-scale retrieval results in Figure 8. 

The performance ordering of other schemes depends on 































the bitrate and type of feature, while DeepHash is consistent 
across data sets. Compared to ITQ scheme which applies a 
single PCA transform, each output bit for DeepHash is gen¬ 
erated by a series of projections. The PQ scheme performs 
poorly at the low rates in consideration, as large blocks of 
the global descriptor are quantized with a small number of 
centroids, as previously observed in [23], LSH performs 
poorly at low rates, but catches up given enough bits. 

We observe a consistent improvement using Siamese 
fine-tuning, which learns more discriminative projections. 
The learnt projections generalize well, which is key for di¬ 
verse retrieval tasks, thus showing the robustness of our pro¬ 
posed method. 

Comparing FV-DeepHash and DCNN-DeepHash. At 

a given rate point, DCNN-DeepHash outperforms FV- 
DeepHash hashes for all data sets except Graphics as seen 
by comparing across rows of Figure 7. At low rates, DCNN- 
DeepHash improves performance by more than 10% on 
some of the small data sets, while the gap increases up to 
20% for the 1M experiments. 

The results are data-set dependent. DCNN features are 
able to capture more complex low level features and have 
a lower starting dimensionality compared to FV. However, 
DCNN features have limited rotation and scale invariance, 
based on the level of data invariance seen at training time. 
FV, on the other hand, aggregate hand-crafted SIFT descrip¬ 
tors from scale and rotation invariant interest points, which 
results in a scale and rotation invariant representation. The 
Graphics data set has more objects with large variations in 
scale and rotation compared to the other data sets: this is 
one of the reasons, why peak performance of FV is higher 
than DCNN for Graphics. 

Comparison to Uncompressed Descriptors. We com¬ 
pare the performance of DeepHash to the uncompressed 
descriptor in Figure 7. We obtain remarkable compression 
performance - at 256 bits for DCNN hashes, we only ob¬ 
serve a marginal drop (a few%) compared to the uncom¬ 
pressed representation for retrieval on a wide range of data 
sets: a 512 x compression compared to a floating point rep¬ 
resentation, and 16 x compared to a binary representation. 
For FV, we can match the performance of the uncompressed 
descriptor with 1024 bits for Holidays and UKB, with a drop 
for Graphics and Oxford. At some rate points, DeepHash 
performs better than the uncompressed descriptor, which is 
due to quantization of noise in the uncompressed descriptor. 

The instance retrieval hashing problem becomes increas¬ 
ingly difficult as we move towards a 64-bit hash. At 64 bits, 
there is a 5-10% drop in performance compared to 256 bits 
for DCNN features, while a drop is also observed for FV. 
For the million scale experiments, however, we observe a 
10-20% drop in performance at 64 bits compared to 1024 
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Figure 8. Large-scale retrieval results (with 1M distractor images) for 
different compression schemes. DeepHash outperforms other schemes at 
most rate points and data sets. 

bits for DCNN features. 

Future Work. Improving performance further at very 
low-rate points like 64 bits for even larger databases is an 
interesting direction for future work. Studying mathemati¬ 
cal models which relate hash size to performance for vary¬ 
ing database size is also an exciting direction to pursue. Fi¬ 
nally, in this work, we learnt compact hashes starting from a 
pre-trained DCNN model. Learning the hash directly from 
pixels in a DCNN framework might lead to further improve¬ 
ments. 

5. Conclusion 

A perfect image hashing scheme would convert a high¬ 
dimensional descriptor into a low-dimensional bit represen¬ 
tation without losing retrieval performance. We believe that 
DeepHash, which focuses on achieving complex hash func¬ 
tions with deep learning, is a significant step in this direc¬ 
tion. Our method is focused on a deep network which ef¬ 
ficiently utilizes the binary subspace through hashing reg¬ 
ularization and further fine-tuning using a Siamese train¬ 
ing algorithm. Through a rigorous evaluation process, we 
show that our model performs well across various data sets, 
regardless of the type of image descriptors used, sparse 
or dense. The marked improvement over existing hashing 
schemes attests to the importance of regularization, depth 
and fine-tuning for hashing image descriptors. 
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